# Required imports
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import string
import pickle
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_circles
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.metrics import classification_report
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
#import joblib
import warningsIntroduction to Neural Networks
Machine Learning
Introduction to Neural Networks (updated version)
try:
import tensorflow as tf
except ModuleNotFoundError:
%pip install tensorflowtry:
import torch as torch
except ModuleNotFoundError:
%pip install torchtry:
import networkx as nx
except ModuleNotFoundError:
%pip install networkxfrom tensorflow import keras
import torch as torchIntroduction to Artificial Neural Networks
- The artificial neuron takes a series of numerical values as inputs (the leftmost arrows entering the graph). Let us represent these inputs as \(s_1, s_2, \ldots, s_k\).
- Each input is multiplied by a certain weight \(w_1, w_2, \ldots, w_k\) associated with the arrow corresponding to that input and the products are used to form a linear combination: \[w_1s_1 + w_2s_2 + \cdots + w_ks_k\] This step is indicated in the diagram by the plus sign in a circle.
- A bias term \(w_0\) is added to the linear combination: \[w_0 + w_1s_1 + w_2s_2 + \cdots + w_ks_k\] To simplify the discussion and notation we will normally use the bias trick, adding an extra input with fixed value 1 so that the inputs are \(1, s_1, s_2, \ldots, s_k\) and then the bias can be included into the list of weights.
- Finally, the linear combination goes into an activation function \(h\), represented by the step function in the righmost node of the diagram. We will later discuss in detail what activation functions are used in practice, but the resulting value, the output of the artificial neuron will always be \[h(w_0 + w_1s_1 + w_2s_2 + \cdots + w_ks_k)\]
Artificial neurons were introduced in the 1940s (see (Fitch 1944)) and captured the interest of the research community up to the late 1960s. They were conceived in an attempt to provide an abstract model of the operations performed by the neurons in the animal brains. Of course, a biological neuron is a far more sophisticated system with a complex regulation composed of chemical and electrical signals. To say that the artificial neuron is a simplistic model of the biological reality is a huge understatement. But the analogy sticked, and up to the present these computational units are still referred to as artificial neurons. You can learn more about the first years of artificial intelligence in this video and in the book (Hecht-Nielsen 1990). See also the first pages of the lecture notes from past year and the references provided therein.
As you can see, the output of any neuron in the network can be used as the input to one or several other neurons in the network, downstream in the direction indicated by the arrows connecting the neurons. Note also some simplifications that are common in this type of ANN diagrams: + We do not include the weights associated with each input. Every arrow entering a neuron is assumed to carry an implicit weight along with it. + The situation for the bias is even more extreme: the bias terms do not appear, not even as an arrow implying their presence. But they are there and you need to keep in mind that every neuron includes its own bias term.
There is in principle no limitation to the topology or architecture of the ANN; that is, the number and ways in which neurons are connected to each other. However, already in the early days of AI it was realized that sticking to some particular families of ANN architectures greatly simplified their understanding and their use.
A multilayer perceptron network has:
- One input layer, which simply contains the (vector of) values of the input data set.
- One or more hidden layers of neurons. In modern ANNs there are a number of different types of hidden layers. But for the multilayer perceptron all the hidden layers are of the same type. They are dense (or fully connected) layers, meaning that each neuron in layer \(i\) is connected with every neuron in layer \(i+1\)
- And an output layer. The structure of the output layer depends of the nature of the problem that the network is intended to solve. For a regression problem the output layer has a single neuron, that returns the predicted output value for the inputs we have used. In a binary classification problem the output layer will again have a single neuron, but the value is now interpreted as a the score of the positive class, as in other classfication algorithms we have seen. In a multiclass problem the output layer will have more neurons, each one producing the score corresponding to a given class (level) of the putput factor.
We are going to discuss in detail all of these components below.
The multilayer perceptron is a particular type of feedforward neural network The term feedforward indicates that the information flows in one direction, from the input to the output (left to right in the above diagram), without any feedback loops or recurrent connections. The multilayer perceptrons are historically the first examples of neural networks. But there are other important members of the feedforward family, such as the radial basis function (RBF) networks, which are related to the gaussian mixture models we have seen. Also the convolutional neural networks (CNN) are technically feedforward networks as well.
Sigmoid Activation Function
This is a function that we already know from our work with logistic regression: \[\sigma(x) = \dfrac{1}{1 + e^{-x}}\] Let us use Python to plot it:
x = np.linspace(-15, 15, 100)
fig, ax = plt.subplots(figsize = (8, 3))
sns.lineplot(x = x, y = 1/(1 + np.exp(-x)))/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<Axes: >
Hyperbolic Tangent Activation Function
This function is similar in shape to the logistic curve, but it constrains the values to be in the \([-1, 1]\) interval (instead of the \([-1, 1]\) used by the logistic): \[\tanh(x) = \dfrac{e^{x} + e^{-x}}{e^{x} - e^{-x}}\]
fig, ax = plt.subplots(figsize = (8, 3))
sns.lineplot(x = x, y = np.tanh(x))/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<Axes: >
RELU (Rectified Linear) Activation Function
Probnly the most popular activation function nowadays, the RELU functions is defined as: \[ RELU(x) = \begin{cases} x &\text{ if }x \geq 0\\[3mm] 0 &\text{ if }x < 0\\ \end{cases} \] which means that the plot is:
def relu(x):
return np.array([max(0, u) for u in x])
fig, ax = plt.subplots(figsize = (8, 3))
sns.lineplot(x = x, y = tf.nn.relu(x))/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<Axes: >
As you can see this function is not linear, but piecewise linear. However, this very basic kind of non linearity is enough to prevent the network collapse. And on the other hand the computation of the RELU value is a simple if statement, whereas the previous activation functions are computationally much more expensive. This is one of the reasons why RELU has become so popular in recent years.
Multilayer Perceptron Examples in scikit-learn
Let us begin by recreating the example dataset. To make things more interesting we select an example where a linear boundary is clearly not sufficient.
X, Y = make_circles(n_samples=1000, factor=.2, noise=0.15, random_state=2024)
inputs = ["X" + str(k) for k in range(X.shape[1])]
output = "Y"
XTR, XTS, YTR, YTS = train_test_split(X, Y,
test_size=0.2, # percentage preserved as test data
random_state=1, # seed for replication
stratify = Y) # Preserves distribution of y
dfTR = pd.DataFrame(XTR, columns=inputs)
dfTR[output] = YTR
dfTS = pd.DataFrame(XTS, columns=inputs)
dfTS[output] = YTS
fig, ax = plt.subplots(figsize = (5, 5))
sns.scatterplot(dfTR, x=inputs[0], y=inputs[1], hue=output, ax=ax)
plt.show()We are going to use these datasets below, so we will make a backup copy to disk (in Python’s pickle format).
make_circles_data = {'XTR':XTR, 'XTS':XTS, 'YTR':YTR, 'YTS':YTS,
'inputs':inputs, 'output':output}
with open('make_circles_data.pkl', 'wb') as file:
pickle.dump(make_circles_data, file)Next we use the MLPClassifier class in sklearn to train a MLP. To define the model we need to describe the architecture of the neural network. In the case of MLPs, this amounts to defining the number of hidden layers and the number of neurons they contain. For this example we are going to use two hidden layers, and we describe them by using a tuple with the number of neurons in each layer.
Then we also need to select the activation function used by those hidden layers. The MLP implementation in sklearn only allows for chosing a fixed type of activation, that will be applied toall hidden layers. The activation for the output layer is then automatically selected depending on the output variable (a sigmoid for this binary classification problem).
Another important parameter is the number of fitting iterations. We have not discussed yet how we fit the neural network, but it should come as no surprise that this is an iterative process. If the default number of iterations is not enough to deliver convergence, we try to increase it.
# Create and train the MLPClassifier
mlp_binary = MLPClassifier(hidden_layer_sizes=(8, 6), activation="relu", max_iter=2000, random_state=2024)
mlp_binary.fit(XTR, YTR)
model = mlp_binary
model_name = "mlp_binary"Let us also save the model as a pickle object.
make_circles_model = {'mlp_binary':mlp_binary}
with open('make_circles_model.pkl', 'wb') as file:
pickle.dump(make_circles_model, file)The fitted model object contains information about the MLP weights and bias terms in the coefs_ and intercepts_ properties. The coefs_ is a list of arrays, describing the weights that connect each layer to the preceding one. We can inspect their shapes with
[lyr.shape for lyr in model.coefs_][(2, 8), (8, 6), (6, 1)]
The first describes the 16 weights connecting the two inputs (point coordinates) to each of the eight neurons of the first hidden layer. Then the 64 weights connecting the two hidden layers are contained in
model.coefs_[1]array([[-4.51589347e-09, 3.09568370e-01, 3.93462057e-01,
1.07580913e-01, -5.35855889e-17, -7.52727725e-17],
[-3.08069968e-09, 3.68667143e-08, -1.89023491e-07,
4.14136197e-25, 7.98146481e-05, -2.12334178e-09],
[ 2.91508030e-01, 7.22150132e-01, -4.92246382e-01,
-1.49730891e+00, 3.59313307e-25, 1.43021995e-01],
[ 4.18481158e-02, 1.59917234e+00, -6.17295775e-01,
-1.66874272e+00, 3.20263930e-43, 9.57651428e-01],
[-3.90482357e-01, -1.17389485e+00, 1.03347948e+00,
1.83352135e+00, -3.24363017e-39, 1.98997800e+00],
[-3.41442364e-01, -2.07466443e+00, 2.49090921e+00,
1.93033548e+00, -1.31789459e-05, 1.90175454e+00],
[-2.58749913e-01, 5.70928544e-01, 8.56381933e-01,
1.51077681e+00, 2.58794997e-11, -1.61048936e+00],
[ 6.59127482e-31, 3.13993781e-10, 1.25264721e-05,
-9.74042812e-02, -2.30775935e-42, 2.24534395e-05]])
And finally there are six weights connecting the neurons of the second hidden layer to the single neuron in the outout layer (for this binary classification problem).
%run -i mlp_draw.pyNow we only need to feed the fitted model to function. The orientation,figsize, label_pos and label_size allow some customization of the result. Check the architecture of the network in the plot, and see if you can recognize some of the weights that we saw before.
mlp_draw(model=model, orientation="h", label_pos=2.3/7, add_labels=False)Another interesting visualization is the decision boundary of this binary classifier that clearly shows that it has learned a nonlinear solution.
disp = DecisionBoundaryDisplay.from_estimator(model, XTR, response_method="predict", cmap="Greens", alpha = 0.5)
sns.scatterplot(dfTR, x=inputs[0], y=inputs[1], hue=output)
plt.show()Let us now evaluate the classifier performance, using our standard toolset:
# Dataset for Training Predictions
dfTR_eval = dfTR.copy()
# Store the actual predictions
newCol = 'Y_'+ model_name +'_prob_neg';
dfTR_eval[newCol] = model.predict_proba(XTR)[:, 0]
newCol = 'Y_'+ model_name +'_prob_pos';
dfTR_eval[newCol] = model.predict_proba(XTR)[:, 1]
newCol = 'Y_'+ model_name +'_pred';
dfTR_eval[newCol] = model.predict(XTR)# Test predictions dataset
dfTS_eval = dfTS.copy()
newCol = 'Y_'+ model_name +'_prob_neg';
dfTS_eval[newCol] = model.predict_proba(XTS)[:, 0]
newCol = 'Y_'+ model_name +'_prob_pos';
dfTS_eval[newCol] = model.predict_proba(XTS)[:, 1]
newCol = 'Y_'+ model_name +'_pred';
dfTS_eval[newCol] = model.predict(XTS)This generates the confusion matrices and shows that classification is almost perfect both in training and test.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
fig = plt.figure(constrained_layout=True, figsize=(6, 2))
spec = fig.add_gridspec(1, 3)
ax1 = fig.add_subplot(spec[0, 0]);ax1.set_title('Training'); ax1.grid(False)
ax2 = fig.add_subplot(spec[0, 2]);ax2.set_title('Test'); ax2.grid(False)
ConfusionMatrixDisplay.from_estimator(model, XTR, YTR, cmap="Greens", colorbar=False, ax=ax1, labels=[1, 0])
ConfusionMatrixDisplay.from_estimator(model, XTS, YTS, cmap="Greens", colorbar=False, ax=ax2, labels=[1, 0])
plt.suptitle("Confusion Matrices for "+ model_name)
plt.show(); The ROC curves tell a similar story. We satisfy ourselves that this looks good enough, and will not spend more time on performance measures. Feel free to dig deeper, you have all the tools from previous sessions.
from sklearn.metrics import RocCurveDisplay
fig = plt.figure(figsize=(12, 4))
spec = fig.add_gridspec(1, 2)
ax1 = fig.add_subplot(spec[0, 0]);ax1.set_title('Training')
ax2 = fig.add_subplot(spec[0, 1]);ax2.set_title('Test')
RocCurveDisplay.from_estimator(model, XTR, YTR, plot_chance_level=True, ax=ax1)
RocCurveDisplay.from_estimator(model, XTS, YTS, plot_chance_level=True, ax=ax2);
plt.suptitle("ROC Curves for "+ model_name)
plt.show(); This MLP architecture can be used to classify the samples in the classical iris dataset. To see why recall that this dataset uses as inputs four morphological characteristics of the iris flowers, while the output factor Species has three levels.
iris = datasets.load_iris()
X = iris["data"]
X_names = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
df = pd.DataFrame(X, columns=X_names)
Y = "species"
df[Y] = iris["target"]
df.head()| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
Therefore, it requires a MLP model with four inputs and three output neurons. Let us create the train test split and the datasets and variables that we will use. Note that we are scaling the input data.
inputs = X_names
output = Y
XTR, XTS, YTR, YTS = train_test_split(df[inputs], df[Y],
test_size=0.2, # percentage preserved as test data
random_state=1, # seed for replication
stratify = df[Y]) # Preserves distribution of y
iris_scaler = StandardScaler()
XTR = iris_scaler.fit_transform(XTR)
XTS = iris_scaler.transform(XTS)
dfTR = pd.DataFrame(XTR, columns=inputs)
dfTR[output] = YTR
dfTS = pd.DataFrame(XTS, columns=inputs)
dfTS[output] = YTSNow we can define the model’s architecture and train it. Note that we only care about the number and size of the hidden layers. The classifier will look at the number of levels of the output and use the appropriate size and softmax function for the output layer.
mlp_multi = MLPClassifier(hidden_layer_sizes=(3, 2), activation="relu", max_iter=6000, random_state=2024)
mlp_multi.fit(XTR, YTR)
model = mlp_multi
model_name = "mlp_multi"Let us save the data and model to disk.
iris_data = {'XTR':XTR, 'XTS':XTS, 'YTR':YTR, 'YTS':YTS,
'inputs':inputs, 'output':output, 'iris_scaler':iris_scaler}
with open('iris_data.pkl', 'wb') as file:
pickle.dump(iris_data, file)
iris_model = {'mlp_multi':mlp_multi}
with open('iris_model.pkl', 'wb') as file:
pickle.dump(iris_model, file)After the fit we can confirm the expected structure of the weights for this architecture:
[lyr.shape for lyr in model.coefs_][(4, 3), (3, 2), (2, 3)]
Now let us use the mlp_draw function to visualize the fitted network and weights (recall, bias terms are omitted).
mlp_draw(model=model, orientation="h", label_pos=1.3/7)The score predictions of this MLP, as expected, appear as tuples of three values, the scores for each of the three iris species corresponding to the data point being classified. For example, the first flower in the test set gets these scores. The class is assigned using the highest score.
np.round(model.predict_proba(XTS)[0, :], 3)array([0. , 0.001, 0.999])
predicted_class = model.predict(XTS)[0]
iris["target_names"][predicted_class] 'virginica'
And in fact you can see that this very simple MLP model achieves almost perfect prediction on the test set.
pd.crosstab(model.predict(XTS), YTS)| species | 0 | 1 | 2 |
|---|---|---|---|
| row_0 | |||
| 0 | 10 | 0 | 0 |
| 1 | 0 | 10 | 1 |
| 2 | 0 | 0 | 9 |
The main goal of this example is to illustrate that there is not much of a difference between binary and multiclass MLP models, so we move on to another Machine Learning task.
rng = np.random.default_rng(2024)
n = 1000
X = np.linspace(start=0, stop=5, num = n)
Y = X/2 + np.sin(X) * np.cos(2 * X) + 0.3 * rng.normal(size = n)
df = pd.DataFrame({"X":X, "Y":Y})
fig, ax = plt.subplots(figsize= (8, 3))
# sns.lineplot(x = X, y = Y, ax=ax)
sns.scatterplot(data = df, x = "X", y = "Y", ax=ax, s=5)<Axes: xlabel='X', ylabel='Y'>
Next we define the datasets and variables we are going to use:
inputs = ["X"]
output = "Y"
XTR, XTS, YTR, YTS = train_test_split(df[inputs], df[output],
test_size=0.2, # percentage preserved as test data
random_state=1)
dfTR = pd.DataFrame(XTR, columns=inputs)
dfTR[output] = YTR
dfTS = pd.DataFrame(XTS, columns=inputs)
dfTS[output] = YTSAnd we create a MLP model with two hidden layers. We are using a higher number of neurons here and we have switched to MLPRegressor. But note that again we do not need to describe the output layer in any way, sklearn uses the approprate values for a regression problem. Note also that we are scaling the inputs to get a better behavior in model fit. And in order to do that we use the pipeline framework we know.
mlp_reg = MLPRegressor(hidden_layer_sizes=(80, 20), activation="relu", random_state=2024, max_iter=2000)
reg_scaler = StandardScaler()
reg_scaler.set_output(transform="pandas")
mlp_reg_pipe = Pipeline(steps=[('scaler', reg_scaler),
('mlp', mlp_reg)])
mlp_reg_pipe.fit(XTR, YTR)
model = mlp_reg_pipe
model_name = "mlp_reg"Let us save the data and model to disk (in Python’s pickle format).
mlp_reg_data = {'XTR':XTR, 'XTS':XTS, 'YTR':YTR, 'YTS':YTS,
'inputs':inputs, 'output':output, 'reg_scaler':reg_scaler}
with open('mlp_reg_data.pkl', 'wb') as file:
pickle.dump(mlp_reg_data, file)
mlp_reg_model = {'mlp_reg':mlp_reg_pipe}
with open('mlp_reg_model.pkl', 'wb') as file:
pickle.dump(mlp_reg_model, file)Let us know look at the model predictions. To visualize the result and get a qualitative measure of the model fit we plot the fitted values against the original data.
# Dataset for Training Predictions
dfTR_eval = dfTR.copy()
# Store the actual predictions
newCol = 'Y_'+ model_name +'_pred';
dfTR_eval[newCol] = model.predict(XTR)fig, ax = plt.subplots(figsize= (8, 3))
sns.scatterplot(data=dfTR_eval, x = "X", y = "Y", color="r", ax=ax)
sns.scatterplot(data=dfTR_eval, x = "X", y = newCol, color="b", ax=ax)<Axes: xlabel='X', ylabel='Y'>
The same kind of plot for the test set illustrates that the model is indeed learning this non linear signal.
# Dataset for Training Predictions
dfTS_eval = dfTS.copy()
# Store the actual predictions
newCol = 'Y_'+ model_name +'_pred';
dfTS_eval[newCol] = model.predict(XTS)
fig, ax = plt.subplots(figsize= (8, 3))
sns.scatterplot(data=dfTS_eval, x = "X", y = "Y", color="r", ax=ax)
sns.scatterplot(data=dfTS_eval, x = "X", y = newCol, color="b", ax=ax)<Axes: xlabel='X', ylabel='Y'>
A plot of the network architecture to illustrate the limitations of this visualization approach.
mlp_draw(model["mlp"], orientation="h", figsize=(16, 16), alpha=0.5, add_labels=False)But remember that you can always explore the weights and bias terms through the model.
Training MLPs through Gradient Descent and Backpropagation
REACHED THIS POINT
Let us emphasize this idea by playing with the code. Take the training data and fitted model of regression example:
YTR = mlp_reg_data["YTR"]
XTR = mlp_reg_data["XTR"]
model = mlp_reg_model["mlp_reg"]Now let us use this data and model to get the predictions for training, that correspond to what we called \(\hat y_i\) above. We show the first ones.
YTR_pred = model.predict(XTR)
YTR_pred[:10]array([ 0.26670396, 3.69201112, 3.6049855 , 0.37045803, 1.58530131,
1.54973299, 0.36139089, 0.0949725 , -0.18409113, 3.38742146])
These YTR_pred predictions are the result of the forward pass of the entire training set through the network, using the present weights and bias terms of the model. And now we can use them to get the value of the loss function (properly speaking, an estimate of the value of the loss function):
loss = (1/2) * ((YTR_pred - YTR)**2).mean()
loss0.049319401195777296
But if we change the size and/or values of the inputs, creating for example a new shorter array:
Xnew = np.random.default_rng(2024).normal(size = 10).reshape(-1, 1)
Xnewarray([[ 1.02885687],
[ 1.64192004],
[ 1.14671953],
[-0.97317952],
[-1.3928001 ],
[ 0.06719636],
[ 0.86135092],
[ 0.5091868 ],
[ 1.81028557],
[ 0.75084347]])
and the associated output values:
Ynew = Xnew/2 + np.sin(Xnew) * np.cos(2 * Xnew) + 0.3 * rng.normal(size = Xnew.shape[0])Then we can run them through the betwork (forward pass) and get a prediction:
Ynew_pred = model.predict(Xnew)
Ynew_pred[:10]/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/sklearn/base.py:464: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
warnings.warn(
array([ 0.14799548, -0.2376762 , 0.05769325, 0.31106307, 0.2932709 ,
0.35958583, 0.27633267, 0.38818285, 0.04876662, 0.36026116])
Then we can get a new value of the loss:
loss_new = (1/2) * ((Ynew_pred - Ynew)**2).mean()
loss_new0.09540873679602241
The neural network does not care if this values belong to the training set, or even about how many they are. But we can take a step further to better understand training. Below we see the 20 weights in the (fitted to training data) outer layer:
model["mlp"].coefs_[2]
outer_weights = model["mlp"].coefs_[2]
outer_weightsarray([[-5.25548563e-01],
[-1.82610993e-01],
[-2.07258451e-02],
[ 5.01400632e-01],
[-2.10009181e-01],
[ 1.33174938e-02],
[-4.89055375e-01],
[ 2.01155719e-02],
[ 2.95700964e-01],
[-1.40096016e-01],
[-2.77540400e-01],
[ 1.99960980e-09],
[-3.39275348e-07],
[-3.11143961e-01],
[ 4.89647944e-01],
[ 5.74968143e-01],
[-4.48779105e-01],
[ 4.07173610e-01],
[ 1.26540951e-06],
[-4.77446082e-01]])
Let us modify them, replacing them with arbitrary uniform values:
model["mlp"].coefs_[2] = np.random.default_rng(2024).uniform(low = 0, high = 0.1, size = 20).reshape(-1, 1)
model["mlp"].coefs_[2] array([[0.06758313],
[0.02143232],
[0.0309452 ],
[0.07994661],
[0.09958021],
[0.01422318],
[0.00787255],
[0.01808238],
[0.03596469],
[0.01696192],
[0.05887593],
[0.06168075],
[0.01053857],
[0.05657311],
[0.00046296],
[0.04651192],
[0.09756222],
[0.07994284],
[0.05968224],
[0.03253497]])
Now we have modified the neural network itself. In fact, in this case we have untrained it (effectively ruining it). To see this, let us run again the training set through this altered neural network to get the predictions. You can compare them with the predictions of the original fitted and yet untampered network.
YTR_pred = model.predict(XTR)
YTR_pred[:10]array([0.39286751, 0.56660871, 0.55717158, 0.47475645, 0.3529616 ,
0.34419725, 0.48586723, 0.39680396, 0.4032007 , 0.53357875])
Accordingly, the loss function value is different and much worse!
loss = (1/2) * ((YTR_pred - YTR)**2).mean()
loss0.9872144488030444
Before we continue let us restore the model to its pristine fitted state:
model["mlp"].coefs_[2] = outer_weightsLet us use the models that we have fitted to illustrate the role played by the loss function. For example, this is the graph of the values of the loss function for the iris dataset as the training of the neural network proceeds:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
fig, axs = plt.subplots(1, 3, figsize=(16, 3))
axs[0].set_title("Binary Classification (make_circles) Loss curve")
sns.lineplot(make_circles_model['mlp_binary'].loss_curve_, ax=axs[0])
axs[1].set_title("Multilevel Classification (iris) Loss curve")
sns.lineplot(iris_model['mlp_multi'].loss_curve_, ax=axs[1])
axs[2].set_title("Regression Loss curve")
sns.lineplot(mlp_reg_model['mlp_reg']['mlp'].loss_curve_, ax=axs[2])The three pictures illustrate that the fitting process proceeds by searching for smaller values of the loss.
alphas = [10**k for k in range(-5, 5, 2)]
loss_curves = []
test_scores = []
for i in range(len(alphas)):
mlp_reg_batch = MLPRegressor(hidden_layer_sizes=(80, 20), alpha=alphas[i],
activation="relu", random_state=2024, max_iter=2000)
reg_scaler_batch = StandardScaler()
reg_scaler_batch.set_output(transform="pandas")
mlp_reg_batch_pipe = Pipeline(steps=[('scaler', reg_scaler_batch),
('mlp', mlp_reg_batch)])
mlp_reg_batch_pipe.fit(XTR, YTR)
test_scores.append(mlp_reg_batch_pipe.score(XTS, YTS))
model_batch = mlp_reg_batch_pipe
loss_curves.append(model_batch['mlp'].loss_curve_)with warnings.catch_warnings():
warnings.simplefilter("ignore")
fig, axs = plt.subplots(len(alphas), 1, figsize=(6, 12))
for i in range(len(alphas)):
sns.lineplot(loss_curves[i], ax=axs[i])
axs[i].set_title(f"""Loss curve for regularization alpha = {alphas[i]}\n
Final loss {np.round(loss_curves[i][-1], 3)},
Test Score {np.round(test_scores[i], 3)}""")
fig.tight_layout()
In scikit-learn we can control this with the batch_size argument of both MLPClassifier and MLPRegressor.
batch_sizes = [2, 64, 256, XTR.shape[0]]
loss_curves = []
for i in range(len(batch_sizes)):
mlp_reg_batch = MLPRegressor(hidden_layer_sizes=(80, 20), batch_size= batch_sizes[i],
activation="relu", random_state=2024, max_iter=2000)
reg_scaler_batch = StandardScaler()
reg_scaler_batch.set_output(transform="pandas")
mlp_reg_batch_pipe = Pipeline(steps=[('scaler', reg_scaler_batch),
('mlp', mlp_reg_batch)])
mlp_reg_batch_pipe.fit(XTR, YTR)
model_batch = mlp_reg_batch_pipe
loss_curves.append(model_batch['mlp'].loss_curve_)with warnings.catch_warnings():
warnings.simplefilter("ignore")
fig, axs = plt.subplots(1, len(batch_sizes), figsize=(16, 3))
for i in range(len(batch_sizes)):
sns.lineplot(loss_curves[i], ax=axs[i])
axs[i].set_title(f"Loss curve for batch size {batch_sizes[i]}\n Final loss {np.round(loss_curves[i][-1], 3)}")
The results show that, all other things equal, using the whole training set (last loss curve) is not a good idea. Apparently it fails to reach the minimum. Note also that the first curve is much noisier, due to the fact that the estimates of the gradient are much more unstable.
learning_rates = [0.00001, 0.005, 0.01, 0.05, 0.1, 0.5]
loss_curves = []
train_scores = []
for i in range(len(learning_rates)):
# print(learning_rates[i])
mlp_reg_batch = MLPRegressor(hidden_layer_sizes=(80, 20), learning_rate='constant', solver='sgd',
learning_rate_init=learning_rates[i], momentum = 0, activation="relu", random_state=2024, max_iter=2000)
reg_scaler_batch = StandardScaler()
reg_scaler_batch.set_output(transform="pandas")
mlp_reg_batch_pipe = Pipeline(steps=[('scaler', reg_scaler_batch),
('mlp', mlp_reg_batch)])
mlp_reg_batch_pipe.fit(XTR, YTR)
train_scores.append(cross_val_score(mlp_reg_batch_pipe, XTR, YTR, scoring='neg_mean_squared_error').mean())
model_batch = mlp_reg_batch_pipe
loss_curves.append(model_batch['mlp'].loss_curve_)/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
warnings.warn(
/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
warnings.warn(
/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
warnings.warn(
/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
warnings.warn(
/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
warnings.warn(
/Users/fernando/miniconda3/envs/MLMIC24/lib/python3.12/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (2000) reached and the optimization hasn't converged yet.
warnings.warn(
Now let us see the loss curves and validation scores of the different learning rates.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
fig, axs = plt.subplots(len(learning_rates), 1, figsize=(6, 12))
for i in range(len(learning_rates)):
sns.lineplot(loss_curves[i], ax=axs[i])
axs[i].set_title(f"""Loss curve for learning rate = {learning_rates[i]}\n
Final loss {np.round(loss_curves[i][-1], 3)},
Training Score (negMSE) {np.round(train_scores[i], 3)}""")
fig.tight_layout()
The result of the experiment confirms what we said before. The smallest learning rate value gives a very bad result which fails to find any interesting loss value after 2000 iterations. We are simply moving too slow. At the other side of the experiment, the largest learning rate results in a bouncing loss function that does not provide a good fit. Meanwhile, some intermediate values of the learning rate seem to be doing much better.
Let us now bring back momemntum; it is on by default, so we only need to remove the momentum=0 option and rerun the code.
learning_rates = [0.00001, 0.005, 0.01, 0.05, 0.1, 0.5]
loss_curves = []
train_scores = []
for i in range(len(learning_rates)):
# print(learning_rates[i])
mlp_reg_batch = MLPRegressor(hidden_layer_sizes=(80, 20), learning_rate='constant', solver='sgd',
learning_rate_init=learning_rates[i], activation="relu", random_state=2024, max_iter=2000)
reg_scaler_batch = StandardScaler()
reg_scaler_batch.set_output(transform="pandas")
mlp_reg_batch_pipe = Pipeline(steps=[('scaler', reg_scaler_batch),
('mlp', mlp_reg_batch)])
mlp_reg_batch_pipe.fit(XTR, YTR)
train_scores.append(cross_val_score(mlp_reg_batch_pipe, XTR, YTR, scoring='neg_mean_squared_error').mean())
model_batch = mlp_reg_batch_pipe
loss_curves.append(model_batch['mlp'].loss_curve_)with warnings.catch_warnings():
warnings.simplefilter("ignore")
fig, axs = plt.subplots(len(learning_rates), 1, figsize=(6, 12))
for i in range(len(learning_rates)):
sns.lineplot(loss_curves[i], ax=axs[i])
axs[i].set_title(f"""Loss curve for learning rate = {learning_rates[i]}\n
Final loss {np.round(loss_curves[i][-1], 3)},
Training Score (negMSE) {np.round(train_scores[i], 3)}""")
fig.tight_layout()
The impact of momentum is a global improvement of the behavior across all learning rates. And the ordering is preserved in this case, the best learning rates are still similar, but the convergence of the method improves and we get better final values of the loss.
To understand the notation consider the picture below. We are closely following here the notation in Chapter 10 of (James et al. 2023):
- The weights in the \(r-th\) hidden layer number are called \(w^{(r)}_{ij}\). The subscripts \(i\) and \(j\) indicate the numbers of the neurons they connect.
- For the special case of the input layer, \(i\) indicates the input number. Thus, the weight \(w^{(1)}_{04}\) we are considering connects the input \(X_1\) to the fourth neouron in the first hidden layer.
- The bias term of the \(j-th\) neuron in the \(r-th\) hidden layer is indicated by \(w^{(r)}_{Bj}\).
- The \(j-th\) neuron in the \(r-th\) layer computes the activation \[A^{(r)}_{j} = h^{(r)}(z^{r}_{j}) = h^{(r)}\left(w^{(r)}_{Bj}
+\underbrace{w^{(r)}_{0j}A^{(r-1)}_{0j} + w^{(r)}_{1j}A^{(r-1)}_{1j} + \cdots}_{\text{ sum over all neurons of preceding layer}}
\right)\] where \(h^{(r)}\) is the common activation function for the \(r-th\) hidden layer (e.g. relu).
For the special case of the first hidden layer, the activations \(A^{(r-1)}_{kj}\) must be replaced with the \(k\)-th input value. - For the output layer we use the superscript \((out)\). Thus \(h^{(out)}\) is the output layer activation function (usually the identity in regression), and \(w^{(out)}_{rs}\) indicates a weight in the output layer. In regression there is only a neuron in the ourput layer and so in principle the second subscript could be omitted, but we leave it there for greater generality.
What do the red and blue edges represent? In this regression problem, where the loss is \[{\cal L}_{MSE} = \dfrac{1}{2n}\sum_{i = 1}^n (\hat y_i - y_i)^2\] the predicted value \(\hat y_i\) is computed using the activations in the second hidden layer and the weights \(w^{(out)}_{rs}\) of the output layer. These weights are the rightmost group pf colored edges.
But then we must realize that \(\hat y_k\) has been obtained And each neuron of the second hidden layer is computed from activations coming from the first hidden layer. The second group of red edgex identify how the activation \(A^{(1)}_{4}\) (first hidden layer, neuron number 4) is used as input in each of the neurons of the second hidden layer. And we care about neuron number 4 because the (green edge) weight \(w^{(1)}_{04}\) is one of the inputs of that neuron (and only that neuron).
To apply the chain rule we need to think of all the composition paths that connect the loss function value to \(w^{(1)}_{04}\). Now you can see that this is what the red edges represent. We must compute a product of derivatives along each one of thoose paths and then we add up the results. The blue colored path is a particular example of one such composition paths. Let us follow the chain of derivatives of this blue path in the backward direction of the algorithm. + We can begin by computing \[
\dfrac{\partial \cal L}{\partial \hat y_i} = \dfrac{1}{n}\sum_{i = 1}^n (\hat y_i - y_i)
\] Recall that \(\hat y_i\) is the prediction that the network outputs for input values \((X_i0, X_{i1})\).
Now \[\hat y_i = h^{out}(z_{out}) = h^{out}\left(
w^{out}_{B} + w^{out}_{0}A^{(2)}_0 + w^{out}_{0}A^{(2)}_0 +\cdots +
\color{blue}w^{out}_{3}A^{(2)}_3\color{black} + \cdots + w^{out}_{5}A^{(2)}_5
\right)\] and so the next derivative in the blue chain is: \[\dfrac{\partial \hat y_i}{\partial A^{(2)}_3} = (h^{out})'(z_{out})\cdot w^{out}_{3}\] Now we have reached \(A^{(2)}_3\) with \[
A^{(2)}_3 = h^{2}(z^{(2)}_3) = h^{(2)}\left(
w^{(2)}_{B3} + w^{(2)}_{03}A^{(1)}_{03} + w^{(2)}_{13}A^{(1)}_{1} +\cdots +
\color{blue}w^{(2)}_{43}A^{(1)}_4\color{black} + \cdots + w^{(2)}_{73}A^{(1)}_7
\right)
\] The next derivative in our path is therefore: \[\dfrac{\partial A^{(2)}_3}{\partial A^{(1)}_4} = (h^{2})'(z^{(2)}_3)\cdot w^{(2)}_{43}\] Finally (in this really simple neural network) we reach \(A^{(1)}_4\) with \[
A^{(1)}_4 = h^{1}(z^{(1)}_4) = h^{(1)}\left(
w^{(1)}_{B4} + \color{green}w^{(1)}_{04}X_{i0}\color{black} + w^{(1)}_{14}X_{i1}
\right)
\] The green color highlights that we have reached the desired weight \(w^{(1)}_{04}\) and so the final derivatives in the chain are: \[\dfrac{\partial A^{(1)}_4}{\partial w^{(1)}_{04}} = (h^{1})'(z^{(1)}_4)\cdot X_{i0}\] Putting all the pieces toghether the product of derivatives along the blue path is: \[\dfrac{\partial \cal L}{\partial \hat y_i}\cdot(h^{out})'(z_{out})\cdot w^{out}_{3}\cdot
(h^{2})'(z^{(2)}_3)\cdot w^{(2)}_{43}\cdot (h^{1})'(z^{(1)}_4)\cdot X_{i0}\] The derivative \[\dfrac{\partial \cal L}{\partial w^{(1)}_{04}}\] is a sum of six products like this, one for each red or blue path in the picture. Remember that the forward pass has provided us with all the weights and all the affine combinations \(z^{(i)}_j\). In particular, we know all the values that appear in the above product, and so we can compute the gradient.
The above construction for the gradient may seem intimidating at first, but it should be lcear that we can use matrix computations to organize the chain rule and parallelize it during computation In many cases we find it convenient to think in terms of higher dimensional stacks of arrays, so called tensors. That is the origin of the name of one of the mfirst and most popular deep learning libraries, tensorflow.
The graph below, from a video at 3Blue1Brown, conveys the idea of the backpropagation algorithm as a backward flow of tensor information used to update the gradient.